Skip to content

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225)#1170

Open
Christopher-Lee-McClendon wants to merge 4 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-nativeflow-legal-ttt
Open

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225)#1170
Christopher-Lee-McClendon wants to merge 4 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-nativeflow-legal-ttt

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Mar 31, 2026

Summary

Non-record submission exploring NativeFlowMatcher (NFM) — a 393K-parameter OT-CFM (Optimal Transport Conditional Flow Matching) velocity network that applies gated hidden-state correction to transformer hidden states, jointly trained with the AR objective. The Flow Matching module is trained as distribution transport, but used at inference as a small residual correction.

Results

Three-seed reproducibility (training-time sliding window, no TTT):

Seed SLURM Job Training val_bpb Sliding BPB (no TTT) Artifact Bytes
42 55342820 1.1380 1.12312 15,745,776
1337 55398556 1.1385 1.12367 15,736,933
2025 55398557 1.1359 1.12077 15,745,950
Mean ± Std 1.1375 ± 0.0014 1.12252 ± 0.00151

Primary (seed=42, with legal TTT):

Evaluation val_bpb
Sliding window (stride=64), no TTT 1.12312
Sliding window (stride=64), legal TTT 1.11991

Legal TTT gain: −0.00321 BPB

Legal TTT evaluation for seeds 1337 and 2025 is pending (SLURM jobs 55411651–55411654).

Architecture

  • 11L/512D/GQA(8H/4KV), 3×MLP, 27.5M params total
  • NativeFlowMatcher: 256-dim hidden velocity network with sinusoidal time conditioning, gated Euler step at t=1
  • XSA on all 11 layers, BigramHash(4096,128), LeakyReLU(0.5)², value residual, gated attention
  • Mixed int6/int5 quantization + zstd-16 compression
  • Artifact: 15,745,776 bytes (254K headroom under 16MB cap)

Training

  • 7,000 steps on 1×A100 PCIe 40GB, ~3.86 hours per seed
  • Muon + Adam optimizer, 2048 sequence length
  • Three seeds completed: 42, 1337, 2025

Ablation Studies

2×2 Matrix: NFM × TTT (isolating NFM contribution):

Configuration Params No TTT (BPB) Legal TTT (BPB) Δ (TTT effect)
Base (no NFM) 27,137,223 1.12087 pending pending
NFM (hd=256, lw=0.1) 27,530,952 1.12312 1.11991 −0.00321
Δ (NFM effect) +393,729 +0.00225 pending

Base retraining is running. Loss weight sweep (lw=0.01, 0.05, 0.20) and hidden dim sweep (hd=128, 512) are queued.

Supplementary: E2E TTT + FlowRefiner 7k eval completed: legal TTT BPB = 1.12418.

Limitations

  • Three-seed reproducibility achieved (no-TTT): Mean sliding BPB = 1.12252 ± 0.00151. Legal TTT eval pending for seeds 1337, 2025.
  • Non-record — This submission documents the NFM idea and its interaction with legal TTT. Not sure whether the NFM is worth the extra compute cost for 10 min training / 10 min eval. Number of training steps was chosen to be consistent with those of similar base models (without NFM).
  • NFM adds +0.00225 BPB vs matched base (no NFM) at 7k steps — the extra 393K params do not improve val_bpb. The idea may be more relevant at longer training schedules or combined with other techniques.

Credits

Base architecture (PR #549, @abaybektursun), Muon (baseline), BigramHash/SmearGate (PR #65, @aquariouserworkman), XSA (PR #187/#265, @Idan3011/@unnir), mixed quant (PR #76), sliding window (PR #50, @mattqlf), legal TTT (PR #77, @samacqua, PR #461 @Christopher-Lee-McClendon ), VE/PartialRoPE/LN Scale (PR #315/#374, @jfprincz/@unnir), gated attention/value residual (PR #940), EMA (PR #65, @aquariouserworkman)

Checklist

  • Single training script (train_gpt.py) — self-contained
  • No n-gram cache
  • Legal TTT: score-first, no training on unscored tokens
  • 16MB artifact budget: 15,745,776 bytes
  • README with architecture details, results, provenance
  • submission.json with metadata
  • train.log with training trajectory
  • Three-seed reproducibility (seeds 42, 1337, 2025)
  • Supplementary eval logs and SLURM scripts for all seeds
  • Ablation studies (NFM × TTT matrix, sweep jobs submitted)

- NativeFlowMatcher: 393K-param OT-CFM velocity network with gated hidden-state correction
- Legal score-first TTT: SGD lr=0.002, 10 epochs, freeze_blocks=2
- val_bpb: 1.11991 (sliding window stride=64, legal TTT)
- val_bpb: 1.12312 (sliding window stride=64, no TTT)
- Artifact: 15,745,776 bytes (254K headroom)
- Single-seed (42) exploratory submission
- Supplementary: eval logs, SLURM scripts, comparison data
- 2x2 matrix: NFM x TTT with base no-TTT baseline (1.12087)
- Loss weight sweep: 0.01, 0.05, 0.1, 0.2
- Hidden dim sweep: 128, 256, 512
- 13 SLURM jobs submitted (6 train + 7 eval)
- Results pending, will update when jobs complete
@Christopher-Lee-McClendon
Copy link
Copy Markdown
Author

Ablation Studies Submitted

13 SLURM jobs have been submitted to run comprehensive ablation studies for this NFM submission:

2×2 Matrix: NFM × Legal TTT

Isolating the individual contributions of NFM and legal TTT at matched 7k steps.

Configuration Params No TTT (BPB) Legal TTT (BPB)
Base (no NFM) 27,137,223 1.12087 ✅ pending (→55398695)
NFM (hd=256, lw=0.1) 27,530,952 1.12312 ✅ 1.11991

NFM Hyperparameter Sweeps

Loss weight sweep (hidden_dim=256, seed=42):

  • lw=0.01 → jobs 55398696→55398699
  • lw=0.05 → jobs 55398697→55398700
  • lw=0.10 (default) → 1.12312 ✅
  • lw=0.20 → jobs 55398698→55398701

Hidden dim sweep (loss_weight=0.1, seed=42):

  • hd=128 → jobs 55398702→55398704
  • hd=256 (default) → 1.12312 ✅
  • hd=512 → jobs 55398703→55398705

Also pending

  • 3-seed reproducibility runs (seeds 1337, 2025): jobs 55398556–55398561
  • E2E TTT+Flow 7k reeval with 5h wallclock: job 55398555

Results will be updated in README as jobs complete.

- Training completed for seeds 42, 1337, 2025 (all 7k steps)
- 3-seed mean sliding BPB (no TTT): 1.12252 ± 0.00151
- Seed 42: 1.12312, Seed 1337: 1.12367, Seed 2025: 1.12077
- Legal TTT eval jobs submitted (SLURM 55411651-55411654)
- Added completed E2E TTT+Flow eval log (SLURM 55398555, BPB=1.12418)
- Added training logs and SLURM scripts for all seed runs
- Updated README with 3-seed results table and training trajectories
- Updated submission.json with per-seed metrics and job IDs
@Christopher-Lee-McClendon Christopher-Lee-McClendon changed the title Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (single seed) Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225) Apr 1, 2026
- Three-seed legal TTT: mean 1.11928 ± 0.00146 (seeds 42, 1337, 2025)
- 2×2 NFM×TTT matrix complete: NFM hurts by +0.002 (no-TTT) / +0.001 (TTT)
- Loss weight sweep: lw=0.05 best but still +0.002 worse than base
- Hidden dim sweep: hd=512 best but still +0.001 worse than base
- Updated limitations section to reflect negative result conclusion
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225)

BPB: 1.1199 | Compliance: FLAG — hashed n-gram cache with target-in-key (PR #779 family pattern)

What I found in the code (head SHA 039f4e891256, file records/track_non_record_16mb/2026-03-31_11L_NativeFlowMatcher_LegalTTT/train_gpt.py):

The n-gram lookup key at line 1475 is constructed by XOR-ing the target token into the hash:

line 1475: full_key = <hash> ^ (tgt_np * ng_primes[...]) & mask

This matches the full_key = ((ctx_hash ^ (target * primes[k])) & mask) construction that @valerio-oai ruled disallowed on PR #779 (comment 4145781641, 2026-03-27). Per the mechanism explanation, hashing the target token into the lookup key only reweights the correct token — in the hash-collision limit this drives P(correct) → 1 regardless of the data, which inflates the reported BPB without producing real compression.

Per Issue #1017 condition 1, p_t may depend only on the artifact and x_1...x_{t-1}. Because the lookup key at line 1475 is a function of the target token, the count read at scoring position t depends on x_t itself — which is the core violation the #779 ruling targets.

Cluster context: this same structural pattern has been closed on 15+ PRs under the #779 ruling as of 2026-04-11 (#779 itself, #770, #798, #808, #825, #786, #797, #909, #940, #761, #776, #788, #774, #778, #715, #758, #702 upstream, #1488). The base neural model is unaffected by this flag — in every case where the authors resubmitted without the n-gram cache, the base val_bpb has been in the ~1.10-1.15 range (standard for the SP1024 11L class).

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=115032 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — target-in-key hashed n-gram cache, same family as PR #779.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as the rest of the family-bug cluster. A context-only resubmission (drop the target from the lookup key and use a full-vocabulary reweighting from a single context row, per @valerio-oai's suggested legal path on #779) would be welcomed.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=115032 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants